56 ◾ Bioinformatics
transform (BWT), and Full-text Minute-space (FM-index). Aligners implement a variety
of searching algorithms to determine where short reads originated in the indexed refer-
ence genome sequence. Aligning of a read to a genomic location depends on the sequence
similarity. However, a good aligner should expect mismatches due to the sequencing errors
and genetic variation between the reference genome and a sequenced individual. In the
following, we will discuss these commonly used data structures for indexing and popular
searching methods.
2.2.1 Trie
A trie or a prefix tree is a data structure that is used for fast retrieval on large datas-
ets such as looking up sequencing reads. The name was derived by E. Fredkin in 1960
from the phrase “Information Retrieval” by Fredkin (1960) [5]; however, the idea was
first described by the Norwegian mathematician Axel Thue in 1912. A trie is an ordered
tree that can be used to represent a set of strings (reads) over a finite alphabet set, which
is A, C, G, and T for the DNA sequences. It allows reads with common prefixes to use
that prefix and stores only the ends of the reads to indicate different sequences. A trie is
represented by nodes and edges. The root node is empty and then each node in the trie
represents a single character. The maximum number of possible children of a node is four
in case of DNA. The children of a node are ordered alphabetically. The leaf nodes are the
nodes that represent the last characters of reads. A read is represented by the path from
the root node to a leaf node.
The trie data structure, as it is, is not suitable for indexing a genome sequence since
it stores a set of string. However, the generalized idea of the trie is used in the suffix tree
which is widely used to index a reference genome sequence.
2.2.2 Suffix Tree
The suffix tree, as a generalization of the trie data structure, is basically used for pattern
matching and finding substrings in a given string of characters. It is constructed as key-
value pairs (as the python dictionary) where all possible suffixes of the reference sequence
are the keys and positions (indexes) in the sequence as their values. In 1995, Esko Ukkonen
proposed a linear-time algorithm called Ukkonen’s algorithm [6], which constructs a suf-
fix tree in a linear time complexity that the time taken for the indexing increases linearly
with the increase in the number of nodes O(n).
For the sake of simplicity, we will try to show you how to build a suffix tree for a refer-
ence sequence and how mapping of the reads is carried out. Assume that our reference
sequence consists of 10 bases as “CTTGGCTGGA$”, where the positions of the bases
are 0, 1, 2, …, 9 and $ is an empty trailing position. The first step of constructing a suf-
fix tree is by forming suffixes (keys) and indexes (values) pairs from the sequence. We
begin from the last character in the sequence, which is the empty character “$” in the
position 10 in the sequence. Then, we form the suffix “A$”, whose first character is in
position 9 in the sequence. We continue this way until all possible suffixes are created
as shown below: